WHITE WINE QUALITY ANALYSIS by Priya Khanchandani

Introduction -

White wine is a wine whose colour can be straw-yellow, yellow-green, or yellow-gold. It is produced by the alcoholic fermentation of the non-coloured pulp of grapes, which may have a skin of any colour. Common tests include °Brix, pH, titratable acidity, residual sugar, free or available sulfur, total sulfur, volatile acidity and percent alcohol.

## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/z2/8bdjfjqx7h90_c2x8tbjkrgh0000gn/T//RtmpYpULdR/downloaded_packages
Loading data
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Univariate Plots Section

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The above graph shows that the most of the wines have quality rate between 5 and 7. There are only 5 wine samples that are of highest quality and only 20 wines of least quality.

## 
##    low medium   high 
##    183   4535    180

Acids and ph value in wine

The acids in wine are an important component in both winemaking and the finished product of wine. They are present in both grapes and wine, having direct influences on the color, balance and taste of the wine as well as the growth and vitality of yeast during fermentation and protecting the wine from bacteria.

Below histogram shows the distribution of fixed acidity, volatile acidity and citric acid in all the wines.

There are three types of acidity given in the dataset - Fixed acidity, Volatile Acidity and Citric Acid The three primary acids found in wine grapes are tartaric, malic and citric acids. Most of the acids involved with wine are fixed acids with the notable exception of acetic acid, mostly found in vinegar, which is volatile and can contribute to the wine fault known as volatile acidity. Acetic acid in wine, often referred to as volatile acidity (VA) or vinegar taint, can be contributed by many wine spoilage yeasts and bacteria. From graph we can see that Fixed acidity ranges from 14 gm/dm^3, with maximum at 7 gm/dm^3. Whereas, Volatile acidity ranges from 0 to 1 with max between 0.2 and 0.3 gm/dm^3. From this we can presume that excess of volatile acid can spoil the wine. Citric acid is found only in very minute quantities in wine grapes.These inexpensive supplements can be used by winemakers in acidification to boost the wine’s total acidity. It is used less frequently than tartaric and malic due to the aggressive citric flavors it can add to the wine. The graph shows Citric acidity ranging between 0 and 1.6 with very few wines having more than 0.6 gm/dm^3 of citric acid. A wine with too much acidity will taste excessively sour and sharp. A wine with too little acidity will taste flabby and flat, with less defined flavors Hence we cans ee that most of the wines fall in average range of acidity.

pH in Wine

The strength of acidity is measured according to pH, with most wines having a pH between 2.9 and 3.9. Generally, the lower the pH, the higher the acidity in the wine. However, there is no direct connection between total acidity and pH (it is possible to find wines with a high pH for wine and high acidity). Winemakers use pH as a way to measure ripeness in relation to acidity. Low pH wines will taste tart and crisp, while higher pH wines are more susceptible to bacterial growth. Most wine pH’s fall around 3 or 4; about 3.0 to 3.4 is desirable for white wines

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The graph shows that most of the wines has pH in between 3.0 and 3.28 with average pH at 3.18.

Sweetness of Wine

The subjective sweetness of a wine is determined by the interaction of several factors, including the amount of sugar in the wine, but also the relative levels of alcohol, acids, and tannins. Sugars and alcohol enhance a wine’s sweetness; acids (sourness) and bitter tannins counteract it.

Residual sugar

  1. Among the components influencing how sweet a wine will taste is residual sugar. It is usually measured in grams of sugar per litre of wine, often abbreviated to g/l or g/L.
  2. Residual sugar typically refers to the sugar remaining after fermentation stops, or is stopped, but it can also result from the addition of unfermented must (a technique practiced in Germany and known as Süssreserve) or ordinary table sugar.
  3. Even among the driest wines, it is rare to find wines with a level of less than 1 g/L, due to the unfermentability of certain types of sugars, such as pentose. By contrast, any wine with over 45 g/L would be considered sweet, though many of the great sweet wines have levels much higher than this.

  1. The distribution of residual sugar is positively skewed and shows that most of the wines have residual sugar level between 0 to 5 g/dm^3, which means that most of the wines in the dataset are dry or medium dry
  2. It can be seen that there are many outliers in the residual sugar graph with max value at 65.800 g/dm^3. Any wine with over 45 g/L would be considered sweet, though many of the great sweet wines have levels much higher than this.From graph it can be seen that there are very few wines having value over 45 g/dm^3

Alcohol

White wine is made from white or black grapes (but always with white flesh, the grapes with coloured flesh are called Teinturier meaning coloured juice). Once harvested, the grapes are pressed and only the juice is extracted which is called wort. The wort is put into tanks for fermentation where sugar is transformed into alcohol by yeast present on the grapes.

Here the datasets provides the information about the percent alcohol content of the wine.

Below histogram shows the distribution of alcohol content in all the wines.

  1. The percentage of alcohol content in wine has a multimodal distribution with peaks at different values.
  2. Low Alcohol Wines - Under the 10% ABV level, most wines will be light in body and sweet. Medium-Low Alcohol Wines - Wines ranging from 10–11.5% ABV. Usually produced when less-sweet grapes are used to make wine. There are also several sparkling wines in this alcohol content category because the wine producers pick the grapes a little earlier in the season to insure that the wines stay zesty with higher acidity to compliment the bubbles.

Medium Alcohol Wines - Wines ranging from 11.5%–13.5% ABV

Medium-High Alcohol Wines - Wines ranging from 13.5%–15% ABV.This is the average range of dry American wines and other warm climate growing regions including Argentina, Australia, Spain and Southern Italy. Regions with warmer climates will produce sweeter grapes which in turn increases the potential alcohol content of the wine.

High Alcohol Wines - Wines Over 15% ABV

  1. From above graph it is noticed that there are not many Medium high or high alcoholic wines. Most of the wines in the dataset are Low Alcoholic or Medium low alcohlic with peak in Low alcoholic category. This could be because people prefer wines in social drinking and low alcoholic drinks are good option for that.

Sulphates and Sulphur dioxide in Wine

The first thing to understand about sulfites is that they bind with other things in wine. They bind with micro-organisms, oxygen, solids, yeast, acids, bacteria, and sugars. When this chemical bond happens the sulfite goes from being free to bound. Bound sulfite has already done its job and while it is still in wine it is not free to bind with anything else. Thus we have two different sulfite levels to worry about, free and total.

Free Sulfur Dioxide: A wine needs to be protected against many things that can spoil it. Protection comes only from free sulfites. It prevents microbial growth and the oxidation of wine

Total Sulfur Dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

Thus it is required to know how much sulphite is there in wine already that is free and how much free sulfites we would like to have. When adding sulfites to wine, usually in the form of potassium metabisulfite, some of it will become bound while the rest will remain free. One can’t predict how much will become bound so winemakers add potassium metabisulfite, test it, then adjust as necessary.

The effectiveness of sulfites change with the pH of the wine. The higher the pH the more sulfites is needed to do the same job as it would in a wine with a lower pH. The maximum allowable doses depend on the sugar content of the wine: the residual sugar is susceptible to attack by microorganisms which would cause a restart of fermentation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
  1. Both Free and Total Sulphur dioxide has normalized distribution.
  2. On comparing Free and Total Sulphur dioxide values, Total SO2 appears more than double of free SO2, which shows that bound SO2 is more in quantity in any wine than that of free SO2

Chloride content in White Wine -

The amount of salt in the wine

Density:

The density of wine is close to that of water depending on the percent alcohol and sugar content

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations of wine with 12 variables (11 numeric physicochemical properties and one integer expert review). Other observations: Most of the wine have quality rate of 5, 6, 7 Most of the wines have pH between 2.80 and 3.47 Median alcohol amount is 10.40% Average sugar amount is 6.391 g/dm^3 with the maximum 65.80

What is/are the main feature(s) of interest in your dataset?

I find all the variables important to analyse the datasets. As studied, every chemical property adds to wines quality. However, I would like to focus more on Acidity, Sugar, Alcohol and Sulphates in wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

My interest in this is to analyse what chemical properties and features contribute to high quality wines and what quantity of chemical properties are reason for low quality wines. I think that relationship between each property could define its quality and taste.

Did you create any new variables from existing variables in the dataset?

Yes, I categorized quality into ‘low’, ‘medium’, high’ levels. The wines with rating 3 and 4 are of low quality, the wines with rating 5,6 and 7 are of medium quality and the wines with rating 8 and 9 are of high quality.

Of the features you investigated, were there any unusual distributions?

It is found that in this dataset, every chemical property in the dataset is normally distributed, except Residual Sugar and Alcohol. The residual Sugar has positively skewed distribution and Alcohol has multimodal distribution.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Data is in tidy form and hence no changes have been made.

Bivariate Plots Section

Bivariate Analysis

If we see the median of low, medium and high quality wines, fixed acidity of medium and high quality wines is slightly lower than that of low quality wines

Volatile acidity of low quality wines are higher than that of medium and high quality wines.

There is not much significant difference noticed in citric acidity of low, medium and high quality wines.

The boxplot shows that more percentage of high quality wines have comparatively high pH value.

Graph shows that high quality wines are little sweeter than medium and low quality wines.

High quality wines have distinctively high alcohol with Alcohol level more than 11%.

The graph shows that wine quality does not depend on Potassium sulphate content.

The above boxplot shows that Free Sulfurdioxide is much less in low quality wines.

However, there is not much difference noticed in total sulfurdioxide content of wines with different quality

The above graph shows that wines with high quality has less Sodium chlorides. High quality wines do not have Sodium Chloride more than 0.04 g/dm^3 whereas low quality wines have Sodium Chloride more than 0.038 g/dm^3.

As outliers were affecting the data, limited the y axis in density vs quality class graph. It is noticed that high quality wines have density not more than 0.9937 g/cm^3 i.e thigh quality wines are of comparatively lower density.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in dataset?

  1. Alcohol and density is coefficient corelation of -0.8 i.e density decreases with increase of alcohol or higher alcoholic wines have lower density.
  2. Alcohol and residual sugar have coefficient corelation of -0.5, which could mean that wines with high residual sugar have low alcohol.
  3. Wines with high residual sugar have high density.
  4. Alcohol is negatively corelated with chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The data summary and graph shows that high quality wines have high alcohol level as well as high residual sugar. However there is negative coefficient corelation between Alcohol and Residual Sugar. This is an interesting relationship to analyse.

Other interesting relationship I noticed between Sodium chloride and Wine Quality

What was the strongest relationship you found?

The strongest relationship found are - 1) Between alcohol and density 2) Residual Sugar and Density 3) Alcohol and Sodium Chloride

Multivariate Plots Section

To analyze the impact of various chemical properties and its relationship in defining quality levels, I have subset the dataset to have only high and low quality wines.

Tha graph shows negative corelation between alcohol and density. Low quality wines are high in density and low in alcohol High quality wines are high in alcohol and low in density.

The above graph doesn’t show much relationship between alcohol and residual sugar. Low and high quality wines are almost equally distributed with residual sugar.

There is positive corelation between density and residual sugar. With increase in residual sugar, density increases. However, wines with same residual sugar having high density are low quality compared to high quality wines.

The above graph shows negative corelation between alcohol and chloride content of wines. Wines with high quality have high alcohol and comparatively less Sodium Chloride, whereas low quality wines have comparatively high Sodium chlories and less alcohol level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From above graphs it can be noticed that 1) There is negative corelation between alcohol and density. Low quality wines are high in density and low in alcohol whereas High quality wines are high in alcohol and low in density. 2) There is no much relationship between alcohol and residual sugar. Low and high quality wines are almost equally distributed in terms of residual sugar content. 3) There is positive corelation between density and residual sugar. With increase in residual sugar, density increases. However, wines with same residual sugar having comparatively higher density are of low quality and that of lower density are of high quality. 4) There is negative corelation between alcohol and chloride content of wines. Wines with high quality have high alcohol and comparatively less Sodium Chloride, whereas low quality wines have comparatively high Sodium chlories and less alcohol level.

Were there any interesting or surprising interactions between features?

One interesting relationship I noticed is between alcohol level and Sodium Chloride content of wines. Wines with low sodium chloride have high alcohol level and are better in quality.


Final Plots and Summary

Plot One

Description One

In plot one I have used boxplots to show the content of Alcohol level, density and Sodium Chloride in low, medium and high quality wines. From the graph we can see that high quality wines have comparatively high alcohol level, low density and low sodium chloride content.

Plot Two

Description Two

The above graph is the scatter plot between Alcohol and Density for low and high quality wines. The wines with quality rating 8 and 9 are high quality wines shown by blue color dots and wines with quality 3 and 4 are low quality wines shown by peach color dots. From graph it can be seen that there is negative corelation between Alcohol and Density. Wines with high alcohol have low density and mostly high quality wines. Where as wines with low alcohol level have high density and are low quality wines.

Plot Three

Description Three

The above graph is the scatter plot between Alcohol and Sodium Chloride for low and high quality wines. The wines with quality rating 8 and 9 are high quality wines shown by blue color dots and wines with quality 3 and 4 are low quality wines shown by peach color dots. From graph it can be seen that there is negative corelation between Alcohol and Sodium Chloride. High quality wines are mostly with low Sodium Chloride content and low quality wines have comparatively high chloride content.

Reflection

From above analysis, I found that the quality testers have given preference to wines with comparatively high alcohol level. Though, initially I did not think Sodium Chloride to have any impact on quality level, with this analysis I do see that wines with high sodium chloride did not taste good to quality testers. Another interesting fact in the wine physicochemical properties, I noticed by exploring correlation of residual sugar, density and alcohol: sweater wine has more density and wine with the same sweetness has larger volume of alcohol with lower density.

Limitations- As the quality rating is provided by three testers for all wines, it will not be good to select wines based on only this analysis. However, this analysis gives pretty much idea on what physiochemical properties to look for when selecting any wine.

Referrence - http://winemakersacademy.com/potassium-metabisulfite-additions/